if (!requireNamespace("GSA", quietly = TRUE)){
install.packages("GSA")}
if (!requireNamespace("kable", quietly = TRUE)){
install.packages("kable")}
My dataset from GEO (Gene Expression Omnibus) is GSE131222 of Transcriptional control of sub-type switching ensures adaptation and growth of pancreatic cancer. My gene set aims to see the underlying ectopic expression of GLI2 expression in human pancreatic cancer cell lines specifically looking at YAPC cells. Moreover, 1 ug/ml of Dox treatment allowed for over expression analysis of GLI2 by measuring via RNA-seq. on Illumina hiseq 2000. The authors used TopHat and Cufflinks for quality control and later used qRT-PCR along with SYBR Green assays to measure the intensity. By concluding oncogene genes and detecting induced-higher expression values compare to expression values 6 days before, GLI2 regulation of genes associated with basal-like sub-type switching was detected
Firstly, the data set I obtained from GEO was cleaned and filtered based on duplication. The normalized values of genes that are mostly miRNA genes are eliminated from the gene set. There have been instances that one row of expression has multiple genes associated. This was later found out on the second assignment where we believe this did not cause any problem, due to our limited analysis in the first assignment. Later I analyzed the dispersion among treatments of both triplicate controls that contain empty plasmid of GLI2 and experiments that contain inducible GLI2 plasmid! Since there might be a discrepancy among the gene annotation, which can cause a misunderstanding of our later analysis, I validated my HUGO gene symbols. Furthermore, by identifying HUGO symbols that are respective or similar to those that our clean data set contains. I was able to conclude that my data set is clean, contains no duplicates as well as ready for the later analyses which are normalization, differential expression calculation, non-thresholded & thresholded gene enrichment analyses, fold expression based on the fitted model of GLI2 expression set, and finally for the enrichment mapping with post analysis.
Secondly, by preliminary analyzing the differential expression of our data set via heat map I have concluded that there is a strong difference between the treatment of GLI2 and the control cells in terms of their summarized gene expression by the model. Later, by visualizing the p-value distribution I concluded most of the oncogenes referred on the main article are clustered at the specified p-value interval. Later sample clustering summarized the gene expression difference between the treatment and the control genes. Stringently analyzing Lastly, the non-thresholded gene enrichment map both allowed me to observe genes that are passing our permissive FDR values as well as defined the pathways that are found to be associated with oncogenes. These pathways are then altered visualized and used to conclude that by doing all the analysis on assignment 2, I was able to differentiate the genes that are related to sub-type switching (EMT) and those that are not (non-EMT). Finally, I have separated two data from each other and obtained a ranked gene list based additionally. These analyses taught me how to select a data set, clean a data set, normalize it, analyze to what extent the genes are expressed differentially from one another, and see how the genes are interacting genes in a given pathway or a given gene set. Also, finally, validate the author’s conclusion or come up with my conclusion and promote further research in the area. In this final assignment of mine, I will be concluding my analyses with enrichment mapping by using Cytoscape and its apps as well as conservatively gene set.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/GSEA.png")
Figure 1 GSEA running on Pre-ranked Gene List Panel
I have used GSEA (4.3.2) (4) non-thresholded gene set enrichment method. I have used the Bader Lab gene set of Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt including no HUGO symbols and IEA symbols published on March 1, 2021. It was released in 2023 by Human MiSigDB. I used no collapse to remap gene symbols to avoid confusion. 200 max gene size and 15 min gene size to have more stringent values. I used 1000 permutations that allowed for a shorter running time. Therefore, GSEA resulted in more stringent non-EMT values that are shown to be enriched. There are pathways of genes that are known to be involved in apoptosis. This makes sense because, with my previous results that my log CPM is always negative, there are no positive log CPM values. This was my previous hypothesis that almost every gene treated showed higher expression with the treatments the authors did to induce oncogenic roles.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/enrichment.png")
Figure 2 GSEA Enrichment Results: GSE results indicate there are 4703 upregulated gene sets and 38 down regulated gene sets.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/neg.png")
Figure 3 Negative Enrichment Results: The top gene set for negative is PHOSPHOLIPASES%HUMANCYC%LIPASYN-PWY with 0 FDR and 0 p-value and-0.65 ES.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/pos.png")
Figure 4 Positive Enrichment Results: The top gene set at the positive gene set is TRANSCRIPTIONAL REGULATION BY SMALL RNAS%REACTOME%R-HSA-5578749.3 with 0.784 FDR-value and 0.001 Nominal p-values and 0.98 ES score.
There are pathways of genes that are known to be involved in apoptosis. This makes sense because, with my previous results that my log CPM is always negative, there are no positive log CPM values. This was my previous hypothesis that almost every gene treated showed higher expression with the treatments the authors did to induce oncogenic roles.
Most of the pathways are viral infection pathways. Furthermore, some viruses stimulate pancreas cancer and liver cancer as well such as chronic Hepatitis C and chronic Hepatitis B (6) A significant proportion of these pathways are systematic disease-causing viruses such as Epstein Barr virus and HIV-1 virus. The negative results contain more oncogenic pathways such that the apoptosis and sialic acid synthesis gene sets are more prevalent as opposed to positive. There is the separation of oncogenic pathways between positive and negative. The second assignment gene enrichment results were more clear in terms of gene pathway role. The difference between EMT genes vs non-EMT genes is more apparent in the thresholded analysis. the gene enrichment results in the non-thresholded analysis are more scarce and do not give significantly enriched results. Even though my parameters are stringent, I am not getting pathways that are adequate in terms of their role.
Finally, the quantitative difference between the thresholded and the non-thresholded is the the thresholded analysis has 771 downregulated genes sets that are less than 0.001 p value and there are 1128 upregulated gene sets. The difference between non-threholded and thresholded is high as 4703 upregualted genes (non-thresholded) - 1128 upregulated genes (thresholded) = 3575 gene sets. The downregualted gene set difference is more substantial (733) genesets more in thresholded analysis.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/pathway.png")
Figure 5 Preliminary Cytoscape Pathway Enrichment Visualization: Both apoptotic and viral gene set are separated from each other. Mesenchymal apoptotic gene sets are apparent. Both upregualted (Red) and downregualted (Blue) gene sets are observed. This includes the whole enrichment map.
The enrichment map Cytoscape visualization was done by using stringent 0.05 p-value and ver permissive 0.5 FDR values (7) The reason why I used 0.5 for FDR is that. Furthermore, Jaccard + Overlap coefficient 0.375 and combined coefficient 0.5 were used. In total there are 291 nodes (including hidden) and 763 edges (including hidden edges) in the visualization. There are more gene sets that I find to be associated with each other.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/publication.png")
Figure 6 Publication Ready Annotation Figure of Gene Enrichment visualization (7): By annotating and analyzing in terms of positivity and negativity the enrichment visualization showed up regulated gene sets to be collapsed in viral export RNA theme. Most of the down regulated are not associated with one another.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/parameters.png")
Figure 7 Annotation parameters: The Default parameters are used due to small visualization and for proper observation. The annotations fit the model as there is strong evidence of the previous connections among cancer and viral genes indicated.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/cluster.png")
Figure 8 (Extra!)Publication ready figure of Gene Enrichment visualization: Collapsed themes indicate that there is indeed relationship between abortive viral RNA gene sets and the export viral RNA gene set.
Apoptotic gene sets are related to those of viral infection gene sets. This supports the presence of viral effects on pancreas cancer as well as many other cancer types. I created the publication-ready figure by using default parameters of MCL cluster 3 max word per label, minimum word occurrence of 1, and adjacent word bonus of 8 with WordCloud label algorithm. By collapsing the enrichment map I found 24 major themes on my map. I do not observe any novel pathways that appeared after total collapsing.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/sapiens.png")
Figure 9 (Extra!) GeneMania gene enrichment pathway results. Most of the genes are virus related(8)
The complete gene set GeneMania analysis to see if the genes in
the respective pathways on the enrichment map are associated. I noted
viral genes mostly. The noted node types
Genemania <- read.csv(file=file.path(getwd(),"data","table_genemania.csv"))
knitr::kable(Genemania, type="html")
| EnrichmentMap..Expression | gene.name | log.score | name | node.type | query.term | score | shared.name |
|---|---|---|---|---|---|---|---|
| 0.0000 | NUP50 | -0.4593206 | H__sapiens__1_-774733 | query | NUP50 | 0.6317127 | H__sapiens__1_-774733 |
| 0.0000 | NUP205 | -0.3864459 | H__sapiens__1_-782635 | query | NUP205 | 0.6794675 | H__sapiens__1_-782635 |
| 0.0000 | RANBP2 | -0.6944706 | H__sapiens__1_-782386 | query | RANBP2 | 0.4993387 | H__sapiens__1_-782386 |
| 0.0000 | TPR | -0.4513919 | H__sapiens__1_-773361 | query | TPR | 0.6367412 | H__sapiens__1_-773361 |
| 0.0000 | POM121 | -0.3202072 | H__sapiens__1_-789815 | query | POM121 | 0.7259986 | H__sapiens__1_-789815 |
| 0.0000 | NUP88 | -0.4133439 | H__sapiens__1_-776294 | query | NUP88 | 0.6614348 | H__sapiens__1_-776294 |
| 0.0000 | NUP188 | -0.4694928 | H__sapiens__1_-774764 | query | NUP188 | 0.6253193 | H__sapiens__1_-774764 |
| 0.0000 | NUP133 | -0.4971920 | H__sapiens__1_-773810 | query | NUP133 | 0.6082362 | H__sapiens__1_-773810 |
| 0.0000 | RAE1 | -0.5025908 | H__sapiens__1_-775191 | query | RAE1 | 0.6049613 | H__sapiens__1_-775191 |
| 35.8284 | NUP160 | -0.4609759 | H__sapiens__1_-773223 | query | NUP160 | 0.6306679 | H__sapiens__1_-773223 |
| 0.0000 | NUP155 | -0.4669676 | H__sapiens__1_-776946 | query | NUP155 | 0.6269004 | H__sapiens__1_-776946 |
| 0.0000 | NUP54 | -0.4274837 | H__sapiens__1_-780479 | query | NUP54 | 0.6521480 | H__sapiens__1_-780479 |
| 0.0000 | NUP153 | -0.5116202 | H__sapiens__1_-778335 | query | NUP153 | 0.5995235 | H__sapiens__1_-778335 |
| 0.0000 | NUP214 | -0.5148612 | H__sapiens__1_-778618 | query | NUP214 | 0.5975835 | H__sapiens__1_-778618 |
| 0.0000 | SEC13 | -0.5645876 | H__sapiens__1_-782804 | query | SEC13 | 0.5685946 | H__sapiens__1_-782804 |
| 0.0000 | NDC1 | -0.1134623 | H__sapiens__1_-773535 | query | NDC1 | 0.8927378 | H__sapiens__1_-773535 |
| 0.0000 | NUP37 | -0.4418337 | H__sapiens__1_-774031 | query | NUP37 | 0.6428565 | H__sapiens__1_-774031 |
| 0.0000 | AAAS | -0.3917891 | H__sapiens__1_-774750 | query | AAAS | 0.6758466 | H__sapiens__1_-774750 |
| 0.0000 | NUP85 | -0.4620475 | H__sapiens__1_-778398 | query | NUP85 | 0.6299924 | H__sapiens__1_-778398 |
| 0.0000 | NUP43 | -0.4713705 | H__sapiens__1_-777773 | query | NUP43 | 0.6241463 | H__sapiens__1_-777773 |
| 0.0000 | NUP107 | -0.5350951 | H__sapiens__1_-776652 | query | NUP107 | 0.5856136 | H__sapiens__1_-776652 |
| 0.0000 | NUP93 | -0.4014051 | H__sapiens__1_-775466 | query | NUP93 | 0.6693788 | H__sapiens__1_-775466 |
| 0.0000 | NUP210 | -0.3976326 | H__sapiens__1_-779292 | query | NUP210 | 0.6719089 | H__sapiens__1_-779292 |
| 0.0000 | NUP35 | -0.3966731 | H__sapiens__1_-783644 | query | NUP35 | 0.6725539 | H__sapiens__1_-783644 |
| 0.0000 | NUP62 | -0.4904854 | H__sapiens__1_-795021 | query | NUP62 | 0.6123291 | H__sapiens__1_-795021 |
| 0.0000 | POM121C | -0.0841121 | H__sapiens__1_-826631 | query | POM121C | 0.9193282 | H__sapiens__1_-826631 |
library(GSA)
gmt_file <- file.path(getwd(),"data",
"Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt")
capture.output(genesets <-
GSA.read.gmt(gmt_file),file="GSA_file.txt")
names(genesets$genesets) <- genesets$geneset.names
knitr::kable(head(genesets$geneset.names), type="html")
| x |
|---|
| PROTEIN CITRULLINATION%HUMANCYC%PWY-4921 |
| METHYLGLYOXAL DEGRADATION III%HUMANCYC%PWY-5453 |
| STEARATE BIOSYNTHESIS I (ANIMALS)%HUMANCYC%PWY-5972 |
| TRYPTOPHAN DEGRADATION X (MAMMALIAN, VIA TRYPTAMINE)%HUMANCYC%PWY-6307 |
| CERAMIDE BIOSYNTHESIS%HUMANCYC%PWY3DJ-12 |
| L-SERINE DEGRADATION%HUMANCYC%SERDEG-PWY |
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/p53.png")
Figure 10 (Extra!) Signature gene set p53 against all gene sets enriched: Post analysis coefficient of 0.25 Mann-Whitney cut-off (Two-sided) was used. Post-analysis results also show that p63 is mainly involved in reactive oxygen species (ROS) metabolism. ROS are known to be tumour growth promoters which then cause EMT in different human tissues as well as trigger cancer of different type.
The post analysis includes the signature gene set from Bader
gene sets (5) of Human symbols of Drugbank approved gene set. p53 is an
oncogene and it is involved in many pathways that are known to be
causing different types of cancers. P53 is associated with viral
pathways as well as some oncogenic pathways. Therefore, it is safe to
assume that the viral effect on oncogenic genes for triggering tumours
is indeed true.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/NANS.png")
Figure 11 NANS (encircled red) Sialic Acid pathway of Homo sapiens(11)
The sialic acid pathway is one of the pathways that is known to be affecting tumor proliferation, cell growth, cell detachment and as a result metastasis. It is known to be highly expressed by the cancer cell. Moreover, cancer-triggered apoptosis is one of the results of sialic acid. Neu5Ac 9-phosphate synthase (NANS)( encircled red) is an enzyme in sialic acid metabolism.NANS’ role in sialic acid pathways on the figure. Also, glycolipid-bound sialic acids are highly found in cancer cells. Finally, by having high expression levels and visualized on the pathway analysis NANS has the potential to be an oncogene. The role of sialic acid differs as there is an interaction between cancer genes and the viral genes, I do hypothesize that cancer causing viruses are an important factor in high sialic acid levels in pancreas cancer.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/sialicacid.jpg")
Figure 12 Oncogenic role of Sialic acid Metabolism (10)
1.Do the enrichment results support conclusions or mechanism discussed in the original paper? How do these results differ from the results you got from Assignment 2 thresholded methods ?
Yes, the enrichment results support the conclusion. It not only supports the conclusion of EMT genes that are known to be unregulated and found on the positive side of the gene enrichment analysis but also a finding of viral effect. This viral effect on different types of cancer has been discussed in different literature. However, the cancer oncogene p53 has a relationship with oncogenic viral cancer-inducing genes. Further research is required to illustrate both pancreas cancer and much other cancer. The results that we obtained from the thresholded analysis are partly different from what we obtained from the non-thresholded analysis. I am confident that this is due to the gene-set difference as well as experimental errors in data analysis. However, we were lucky to get partial positive results and conclude similarly to the main paper’s conclusion.
2.Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your result ?
In the third part of the assignment the post-analysis of the GSEA results allowed to see the direct interaction between different viral cancerous genes. This paves a path for further research in viral effect on the pancreas cancer, by strengthening the conclusion that the authors made. The results from g:Profile and the GSEA correlate in a way that both of these analysis do have similar numbers of genes lower than 1% p-value. This makes sense because both of these analysis indicate the conclusion of hidden viral genes that the author did not show in their analysis.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/cancer.jpeg")
Figure 13: Oncogenic role of Sialic acid Metabolism(9)